Goto

Collaborating Authors

 patch-based inference


Supplementary Material StreamNet: Memory-Efficient Streaming Tiny Deep Learning Inference on the Microcontroller Contents

Neural Information Processing Systems

However, TFLM's interpreter increases the performance overhead of the TinyML applications on MCUs. Unlike TFLM, StreamNet and MCUNetv2 replace the interpreter with a code generator. The system architecture of StreamNet contains the frontend and backend processing. Table 1 presents the data of StreamNet-2D. In Table 1, StreamNet achieves a geometric mean of 5.11X speedup TinyML models collected at the compile time to guide its auto-tuning framework.



StreamNet: Memory-Efficient Streaming Tiny Deep Learning Inference on the Microcontroller

Neural Information Processing Systems

With the emerging Tiny Machine Learning (TinyML) inference applications, there is a growing interest when deploying TinyML models on the low-power Microcontroller Unit (MCU). However, deploying TinyML models on MCUs reveals several challenges due to the MCU's resource constraints, such as small flash memory, tight SRAM memory budget, and slow CPU performance. Unlike typical layer-wise inference, patch-based inference reduces the peak usage of SRAM memory on MCUs by saving small patches rather than the entire tensor in the SRAM memory. However, the processing of patch-based inference tremendously increases the amount of MACs against the layer-wise method. Thus, this notoriously computational overhead makes patch-based inference undesirable on MCUs. This work designs StreamNet that employs the stream buffer to eliminate the redundant computation of patch-based inference. StreamNet uses 1D and 2D streaming processing and provides an parameter selection algorithm that automatically improve the performance of patch-based inference with minimal requirements on the MCU's SRAM memory space. In 10 TinyML models, StreamNet-2D achieves a geometric mean of 7.3X speedup and saves 81\% of MACs over the state-of-the-art patch-based inference.






StreamNet: Memory-Efficient Streaming Tiny Deep Learning Inference on the Microcontroller

Neural Information Processing Systems

With the emerging Tiny Machine Learning (TinyML) inference applications, there is a growing interest when deploying TinyML models on the low-power Microcontroller Unit (MCU). However, deploying TinyML models on MCUs reveals several challenges due to the MCU's resource constraints, such as small flash memory, tight SRAM memory budget, and slow CPU performance. Unlike typical layer-wise inference, patch-based inference reduces the peak usage of SRAM memory on MCUs by saving small patches rather than the entire tensor in the SRAM memory. However, the processing of patch-based inference tremendously increases the amount of MACs against the layer-wise method. Thus, this notoriously computational overhead makes patch-based inference undesirable on MCUs. This work designs StreamNet that employs the stream buffer to eliminate the redundant computation of patch-based inference.


Paper review: Patch-based inference for TinyML

#artificialintelligence

Tiny deep learning on microcontroller units (MCUs) is challenging due to the limited memory size. Memory bottleneck exists with MCUs because of the imbalanced memory distribution in convolutional neural network (CNN) designs. For instance, in MobileNetV2 only the first 5 blocks have a high peak memory ( 450kB), becoming the memory bottleneck of the entire network. The remaining 13 blocks have a low memory usage, which can easily fit a 256kB MCU. The peak memory of the initial memory-intensive stage is 8 times higher than the rest of the network.